Visualization of composite plots in R using a programmatic approach and smplot2
Seung Hyun Min
Abstract
In psychology and human neuroscience, the practice of creating multiple subplots and combining them into one composite plot has become common because the nature of research has become more multifaceted and sophisticated. In the last decade, the number of methods and tools for data visualization has surged. For example, R, a programming language, has become widely used in part due to ggplot2, a free, open-source and intuitive plotting library. However, despite its strength and ubiquity, it has some built-in restrictions that are most noticeable when one creates a composite plot, which involves a plotting process with complex, discrete steps. For instance, the current pipeline for creating a composite plot in ggplot2 is tedious, non-linear and repetitive for both beginners and experienced users, forcing them to opt for ad-hoc workflows, generating codes that are computationally inefficient, and resorting to practices that go against the principles of open science out of necessity. To address these issues, I introduce smplot2, an open-source R package that integrates the practices of data visualization in ggplot2 and the programmatic approach of plotting by enhancing the library’s flexibility. The package has potential to empower users by allowing them to create more customizable, dynamic and expressive figures, promoting the reproducibility of complex visualization routines, and linearizing the workflow of visualizing a composite plot.
The Rise of ggplot2
With modern software tools, there has been a surge in the number of methods and tools through which researchers and clinicians can perform data visualization, an important skill in scientific research. For instance, R, a programming language1, has become exponentially prevalent for statistical data visualization in the last 15 years in part due to ggplot2, a plotting library that was introduced in 2009 by Hadley Wickham2. Its citation count has towered over that of Python’s matplotlib (see Figure 1), an extensive, flexible but a challenging low-level plotting library that was first introduced by John Hunter in 20073. The reasons for the recent rise of ggplot2 is that the library is free, open-source and intuitive for users. Layers of graphics can be added sequentially on a plotting space to produce complex plots. The details of the philosophy behind ggplot2, which is better known as the “grammar of graphics”, is well-explained in a tutorial in this journal4. In brief, as long as users know how to add a layer of points, a layer of lines, and other specific layers sequentially using ggplot2’s declarative syntax, they will be able to plot their data in both simple and complex fashions with a high level of customization without applying the programmatic approach, such as creating loops and functions5. Furthermore, due to the active community of users, there exist diverse third-party R packages6–8, which complement ggplot2, that provide shortcut functions for plotting, allowing users to plot data in just a few lines of codes in wide-ranging ways. These factors have made R, rather than Python, a preferable tool for data visualization for researchers and clinicians across disciplines and levels of experience.
Figure 1. Year-to-year citation count of two major plotting libraries in R and python: ggplot2 and matplotlib. The year 2024 shows a partial count of the citations. Each point denotes the timepoint when the authors have published an article regarding their software. Citation counts were retrieved from Google Scholar on April 4, 2024.
Built-in Restrictions of ggplot2
Figure 2. A comparison of the standard routines for subplotting in between matplotlib from Python and ggplot2 from R. In Python, it is standard to generate multiple panels using iterative or functional programming approach. After the plots have been combined, the specific aesthetics of the composite plot, such as the number of rows and columns, as well as the common legend, x-axis and y-axis labels, can be adjusted without modifying the individual plots. Furthermore, it provides a full flexibility for text, shape and other types of annotations to be added on the combined image. So, the process of visualizing a composite plot is linear in Python’s matplotlib with its clear starting and ending points. However, in R’s ggplot2, the process often requires users to go back and forth between the stages of creating individual plots and then combining them. So, users are encouraged to plot one graph at a time and then combine all plots together as late as possible because of ggplot2’s limited flexibility for aesthetics of a composite figure. The goal of smplot2 is to simplify the process of complex data visualizations by resolving these issues.
In psychology and human neuroscience, the practice of creating multiple subplots and combining them into one composite plot is common9. This method of data visualization is known as subplotting. In the last few decades, it has become more widespread as research has become increasingly sophisticated, as demonstrated by the recent trend of including more variables and conditions in experiments, conducting collaborations with other laboratories if possible, and implementing multiple methodologies for data collection and analysis10. These, in turn, create datasets with complicated structures, thereby requiring complex forms of data visualizations. However, as a high-level plotting library - which does not require users to plot each detail of the plot separately - ggplot2 has some built-in restrictions that are most noticeable when one creates a composite plot.
Currently, creating a composite plot in ggplot2 is complex for several reasons. First, although ggplot2 allows for flexible customization of individual plots with concise codes, it is not compatible with the most well-known programmatic approach - iteration using a for loop - unless unorthodox methods are used. Consequently, users unfamiliar with proper methods may struggle with applying iterations in ggplot2.
Second, ggplot2 provides limited options for subplotting. A
typical ggplot2 operation returns a single plot object that can
be easily manipulated or stored. While facet_wrap() and
facet_grid() support data allocation into multiple subplots
(facets) within a single plot object, ggplot2 limits aesthetic
customization of these subplots within a facet plot. For example,
assigning subsets of data to different subplots using multiple or
hierarchical variables, or applying dynamic color schemes for each
variable level is challenging. To circumvent this, users might need to
restructure their data frames to visualize data as intended, but this
only affects components that map data variables to aesthetics, not
elements like axis limits, background themes, or coordinate systems.
For plotting unique visual elements that have no relation to the data across panels, a third method has been used. It involves combining separate ggplot2 plot objects into one composite figure using libraries such as cowplot11 and patchwork12. This method enables users to draw a composite plot flexibly but requires them to code each subplot separately (see Pseudocode 1), resulting in repetitive scripts albeit with minor differences (compare Plot 1 and Plot 8 in Pseudocode 1). Also, this approach restricts the aesthetic control of the composite figure, such as its layout, annotations (including legends) and marginal space (see Figure 2), further encouraging users to code each subplot individually.
Put together, although ggplot2 has enjoyed its widespread user base, for visualizing a composite plot, users have had to write repetitive scripts, seek third-party packages, or resort to a vector graphics editor, straying from the recommended practices for scientific reproducibility.
# Pseudocode 1: Composite plot in Figure 2 using ggplot2
# Generate each plot separately
plot1 <- ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) +
... +
<THEME_FUNCTION>(<LESS SPACING>) + # unique for this panel
<THEME_FUNCTION>(<REMOVE X-TICKS>) + # unique for this panel
<ANNOTATE_FUNCTION>(<TEXT, SHAPE ANNOTATIONS>)
# Repeat for plot2, plot3, .... , plot7
plot8 <- ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) +
...
<THEME_FUNCTION>(<LESS SPACING>) +
<THEME_FUNCTION>(<REMOVE X-TICKS>) + # unique for this panel
<THEME_FUNCTION>(<REMOVE Y-TICKS>) + # unique for this panel
<ANNOTATE_FUNCTION>(<TEXT, SHAPE ANNOTATIONS>)
library(<THIRD_PARTY_PACKAGES>)
multi_plot <- <COMBINE_FUNCTION>(plot1, plot2, plot3, plot4, plot5,
plot6, plot7, plot8)
# Check if the multi_plot looks OK. If not, revise the codes that generate each plot.On the other hand, the workflow is much simpler and computationally efficient in Python’s matplotlib when it comes to building a composite plot. For example, users are often encouraged to use create constructs, such as loops and functions, by using the programmatic approach to plot their data, especially during subplotting. This programmatic method ensures a full flexibility for aesthetics and plotting because users can then allocate subsets of data to unique panels using any number and combinations of variables, as well as dynamically control the aesthetics, such as color. Here’s a pseudocode that illustrates this point in Python’s matplotlib (Pseudocode 2).
# Pseudocode 2: Composite plot in Figure 2 using Python's matplotlib
fig, ax = plt.subplots(nrows = 2, ncols = 4, sharex = True, sharey = True)
for <PLOT INDEX> in range(<NUMBER OF PLOTS>): # 8 iterations
ax[<PLOT INDEX>].<PLOTTING_FUNCTIONS>(<DATA>, <COLOR>)
fig.subplots_adjust(hspace = 0.2, wspace = 0.1) # more spacing between rows than columns
fig.text(x = 0.5, y = 0.95, 'Title of a Composite Figure')The matplotlib pseudocode that generates the composite plot
(Figure 2) is markedly simpler. The first line determines the structure
of the composite figure. The data themselves are plotted within a
for loop at each panel, iterating for the length of the total
number of subplots (eight total, see Figure 2). Due to the programmatic
approach, the color and other aesthetics in the plot for each panel can
be different, yielding more flexibility. In addition, although the codes
that generate the panels are identical, the panels actually
look different from one another as some have y-axis ticks or
x-axis ticks (or both; see Figure 2) because the layout of the combined
figure has already been established in the beginning. Furthermore, the
aesthetics of the composite figure can be controlled, such as the amount
of blank space between panels (hspace and
wspace in Pseudocode 2). Finally, after the panels have
been combined, a common legend and annotations (texts, shapes, points,
patches, lines, etc) can be added anywhere in the composite figure. In
short, matplotlib offers flexibility both at the level of each
panel and the composite figure, making it possible for the workflow of
generating a composite plot to be linear, with its clear start
and resolution. This versatility of control and a structured workflow
for performing complex visualizations are missing in
ggplot2.
A Need for a Solution: smplot2
Although the “grammar of graphics” interface reduces the code length for a standalone figure and has been accountable for the soaring popularity of ggplot2, combining multiple ggplot2 outputs into one composite figure and achieving a full control over aesthetics remains a challenge. It has required users to write lengthy codes for each building block that is a subplot and combine the subplots into one composite figure as late as possible. This is concerning given the fact that ggplot2 has been widely used (see Figure 1) and that research routines in psychology and human neuroscience have become more complex and sophisticated.
To resolve these issues, I introduce smplot2, an open-source R package that integrates the practice of data visualization in ggplot2 and the programmatic approach of plotting by giving access to users an equal level of control in both individual subplots and a composite plot. It has over 40 functions at the time of writing (see 300+ examples in https://smin95.github.io/dataviz) but for brevity, in this tutorial, I will primarily discuss how it can linearize the workflow of visualizing elegant composite plots using a programmatic approach while maximizing the flexibility for aesthetics in ggplot2. All examples here are created with aesthetic defaults of the smplot2 package, which are clean and appropriate for research articles across various fields and data structures. The functions of smplot2 have been optimized for subplotting to maximize the visibility of data in a composite plot by controlling the extent of blank spacing, scaling and the relative text size. I hope that this tutorial can empower readers to perform complex and expressive data visualizations of a composite plot using a structured workflow.
Aim and structure of the tutorial
The aim of this tutorial is not to reiterate the contents of the package’s documentation from the web in its entirety or introduce ggplot25,13,14. Instead, it aims to present a new workflow for the visualization of a composite plot in ggplot2 with a programmatic approach and the smplot2 package. In the first section, I will briefly introduce some of the visualization functions of smplot2, such as its background themes, which improve aesthetics for subplotting. Then, in the next three sections, I will demonstrate how we can produce subplots in ggplot2 iteratively, and then combine them into a composite plot using a linear process (similar as shown in Figure 2 for Python’s matplotlib) with three examples. The examples will become increasingly more sophisticated to demonstrate there is no limit to how users can create and customize composite figures. The tutorial is summarized in Table 1 in the end.
Target audience
The tutorial assumes that readers have some basic knowledge of R and
ggplot2, and some experience with working with data frames
using functions such as filter(), group_by(),
%>% and summarise(). They do not need to be
familiar with concepts of programming or be fluent in any other
programming languages, such as Python. Although, the examples in this
tutorial use randomly generated data based on human vision studies,
readers across disciplines will be able to adapt the codes/examples in
this tutorial easily for their own purpose.
Readers who have not used ggplot2 and R should read Chapters 2-3 of the package’s documentation webpage (https://smin95.github.io/dataviz) before starting this tutorial. The chapters provide a step-by-step guide on how to install RStudio and use ggplot2. Those who have not worked with data frames in R are recommended to read the tutorial by Nordmann et al.4 or the early sections of Chapter 7 of the documentation webpage (sections 7.1 & 7.2). Completing these two prerequisites for the tutorial would take about 2-3 hours.
Installation requirements for this tutorial
These two packages - tidyverse15 and smplot2 - should be downloaded for the completion of the tutorial from the Comprehensive R Archive Network (CRAN). The tidyverse package is a suite of multiple packages, such as ggplot2 (for plotting and saving visualizations), dplyr (for working with data frames), and readr (for reading external data files).
Open science practices
With more than 300 examples, smplot2 has been documented online in detail (https://smin95.github.io/dataviz) with 12 chapters devoted to the package at the time of writing. The documentation webpage was created using the bookdown package for reproducibility (source codes in https://www.github.com/smin95/dataviz). The codes in the tutorial are posted online (https://www.smin95.com/smplot2doc.html).
Introduction to smplot2 - Background Themes
First and foremost, we should load the two packages to memory.
smplot2 offers various plotting and thematic functions. In this section, only the thematic functions will be discussed. For more information about the plotting functions (raincloud plot, slope chart, forest plot, Bland-Altman plot, etc), please see examples in Chapters 3-6 from the documentation webpage.
In this example, a randomly generated data set will be used as shown below:
set.seed(2022) # Set seed for generating random data
df <- data.frame(
Subject = rep(paste0('S', 1:16), times = 3),
Value = c(
rnorm(n = 16, mean = 0, sd = 1.5), # Day 1
rnorm(n = 16, mean = 5, sd = 1.7), # Day 2
rnorm(n = 16, mean = 10, sd = 2.0) # Day 3
),
Time = rep(paste("Day", 1:3), each = 16)
)
head(df)## Subject Value Time
## 1 S1 1.3502130 Day 1
## 2 S2 -1.7600187 Day 1
## 3 S3 -1.3462280 Day 1
## 4 S4 -2.1667521 Day 1
## 5 S5 -0.4965204 Day 1
## 6 S6 -4.3509435 Day 1
The data frame is stored in the object df. Each row of
the df object represents a single observation from each
Subject and Time. The column
Subject stores identifiers for all subjects in the form of
character strings; the column Value stores the dependent
variable, which is the value of interest in this example; the column
Time contains all identifiers for the independent variable,
which has three levels, in the form of character strings:
Day 1, Day 2, and Day 3.
Figure 3. (A) A default raincloud plot with a background theme that has major horizontal grids. (B) A raincloud plot with a classic theme. (C) A raincloud plot with a minimal theme (i.e., no grids).
In this section, raincloud plots are drawn using the function
sm_raincloud() to present the different themes (see Figure
3). Each subject’s data (as points), the sample’s distribution (in
violin plots), median and first and third quartiles (in boxplots) are
typically displayed in a raincloud plot. A black dot below the boxplot
for Day 1 denotes that an outlier is present. Details of this function
are described in Chapter 6 of the documentation webpage. Here,
the data from the Value column are plotted as a function of
Time. We can map the aesthetics (i.e., fill)
within the ggplot() function so that each unique color
represents each level of Time (see the codes for Figure
3).
Each panel of Figure 3 shows a different background theme. The theme
with major horizontal grids is used in Figure 3A by default because
sm_raincloud() implements the theme automatically. However,
this can be overwritten if users add another theme function modularly to
a ggplot2 object (ex. sm_classic() is added to
generate Figures 3). These thematic functions provide minimalistic
aesthetics, and have borders and legends
arguments. The former, if set to borders = TRUE, will print
the border of the panel. The latter, if set to
legends = TRUE, will print the legend of the standalone
plot. There are several background themes in the package:
sm_hgrid()is a theme with horizontal major grids (Figure 2A).sm_vgrid()is a theme with vertical major grids.sm_hvgrid_minor()is a theme with horizontal and vertical grids (major and minor).sm_classic()is a theme with a standard y-axis on the left side and x-axis at the bottom (Figure 2B).sm_minimal()is a theme with no grids (Figure 2C).
# Figure 3A - Major horizontal grids
ggplot(data = df, mapping = aes(x = Time, y = Value, fill = Time)) +
sm_raincloud() + # Default
scale_fill_manual(values = sm_color('blue','darkred','viridian'))# Figure 3B - Classic theme
ggplot(data = df, mapping = aes(x = Time, y = Value, fill = Time)) +
sm_raincloud(sep_level = 3) + # Separates the graphical components
sm_classic() +
scale_fill_manual(values = sm_color('blue','darkred','viridian'))# Figure 3C - White background with no grids
ggplot(data = df, mapping = aes(x = Time, y = Value, fill = Time)) +
sm_raincloud(which_side = 'l') + # Changes the raincloud plot's facing direction
sm_minimal() +
scale_fill_manual(values = sm_color('blue','darkred','viridian')) The themes have been developed to optimize the discernity of each
plotting feature (ex. relative text size, blank spacing, etc) even when
multiple subplots are combined into one composite figure. Here, for
instance, the three examples of the raincloud plot have been combined
into one figure using the function sm_put_together() (codes
not shown), which we will discuss about extensively in the later
sections. To foreshadow, sm_put_together(), which combines
subplots into a composite figure (as described later in the text),
essentially interacts with these themes and other functions to optimize
the aesthetics so that each plotting feature is discernible in a
multi-panel, composite figure. For this reason, these functions are
discussed before we create a composite plot. The hex codes of the three
colors in Figure 3 are from the sm_color() function, which
primarily archives colors with high visibility, an important factor when
one creates a composite plot.
In the next three examples where composite plots will be created, we
will strictly use these thematic functions (ex. sm_hgrid()
and sm_minimal()) and geom_*() functions to
plot data in the form of lines and points (i.e., simplest type of data
visualization) so that users across all levels of experience and
background can understand the codes without knowing the plotting
functions of smplot2.
Example 1: Subplotting Data Using One Variable
Simulated dataset
Amblyopia is a visual deficit with origins in the primary visual cortex16. The simulated data here represent visual health in individuals with amblyopia and normal vision at various experimental conditions and types of visual stimuli. They will be used throughout the rest of the tutorial.
df2 <- read_csv('https://www.smin95.com/amblyopia_random2.csv')
df2_amb <- df2 %>% filter(Group == 'Amblyopia') %>%
mutate(logSF = log2(SF)) %>%
mutate(Condition = factor(Condition, levels = c('One','Two','Three')))
head(df2_amb)## # A tibble: 6 × 6
## Subject absBP SF Group Condition logSF
## <chr> <dbl> <dbl> <chr> <fct> <dbl>
## 1 A1 0.168 0.5 Amblyopia Three -1
## 2 A1 1.37 1 Amblyopia Three 0
## 3 A1 1.29 2 Amblyopia Three 1
## 4 A1 2.67 4 Amblyopia Three 2
## 5 A1 0.0111 8 Amblyopia Three 3
## 6 A2 0.0136 0.5 Amblyopia Three -1
This dataset should be loaded to memory using the code above.
Throughout the tutorial, we will use %>% operator, which
is known as the pipe. It allows the data frame from the
previous operation of data transformation to be carried over or
piped to the next operation. This reduces the burden of users
from supplying the input data frame for each operation. To begin with,
we extract the data from the df2 object only for
individuals in the Group == 'Amblyopia' by using
filter(). Next, the continuous variable SF,
which is an acronym for spatial frequency, is converted into log2 scale
using mutate(), which creates another column
(logSF) based on the existing column (SF) in
the data frame df2. Through this logarithmic operation in
mutate(), a new column logSF is created, with
equal spacing along its scale. Then, the data type of the
Condition column is changed using mutate(); it
is initially a string but mutate() converts it into a
factor and then re-orders the level of the variable to its numerical
order ('One'-'Two'-'Three').
After the data transformations, a newly formed data frame is stored in
the object df2_amb, whose first six rows are displayed in
the tutorial. Readers can double-check by comparing their own printed
values of the df2_amb object to those in the tutorial.
The column absBP, which is short for absolute balance
point, contains data of the dependent variable (y-axis). It is a measure
of visual health. The higher the value, the worse the vision.
In this example, we will allocate data to each panel by using the
variable Subject (i.e., subplotting with one variable). In
other words, each panel will display the data of each subject
(absBP as a function of logSF).
lapply()
lapply() is one of the apply functions from
base R. It applies a function to a list or a vector, and returns a
list with the same length as the input. A list is a data structure
of an object that can contain different types of elements, such as
strings, numbers and lists. Essentially, lapply() is
similar to how a for loop works but it returns a list
as output. Since ggplot2 objects can be stored in a
list but not in other types of vectors, we will use
lapply() to perform iterations. Pseudocode 3 shows the
basic syntax of lapply().
The input can be either a list or a vector. If the input has a length of five (i.e., five elements), then the function will be run five times, and an output list that has a length of five will be returned. In our case, the function will be plotting the data, with specific mapping and aesthetics, and generate ggplot2 objects. We will plot each of the nine individuals’ data, so we will run the function nine times (i.e., nine iterations). The returning output should therefore have a length of nine, each of which is a plot. Additional arguments can be passed to the function but in our tutorial there will not be any additional arguments, so these can be ignored.
First, we create an input object that specifies the nine subjects in
the Amblyopia Group. The data frame df2
contains a column of identifiers for subjects. We see that these
subjects have identifiers as A1 to A9. This
can be recreated as vector subj_list by concatenating the
string 'A' with a sequence numbers 1:9 (1 to 9
in integers) using the function paste0(). So, the elements
within the vector subj_list will contain subject
identifiers that are also found in the Subject column of
the data frame df2.
In the lapply() structure, through which we will plot
the data of each subject on a separate panel, there should be
two parts. will be used throughout the rest of the
tutorial, and it can be widely applicable across designs and
disciplines:
The first part filters data using the index of the iteration. Here,
iSubjis the index of the iteration, and it starts from 1 and ends at 9 as specified by1:length(subj_list). During each iteration, the index is used to retrieve the element of the objectsubj_list, ex.A9fromsubj_list[iSubj]wheniSubj = 9. The extracted subject identifier is then used to filter for each subject’s data before plotting begins (ex.filter(Subject == subj_list[iSubj])). The filtered data is stored in the objectsubj_data, which will be used by the subsequent plotting functions; so, plotting will only use the filtered data from each subject.The second part of the lapply() function plots the filtered data. The variables are mapped to aesthetics, and the appearance of the plot is customized using functions from ggplot2. Here, the specifications are set so that the data frame to be used is
subj_data, thatxislogSF, and thatyisabsBP, which is the outcome of interest in this simulated dataset. Moreover, the aestheticgroupis mapped to the variableConditionof the data framesubj_data, so that the points are connected with lines for each condition. Also, within theggplot()function, theshape, the filling color of the points (fill), as well ascolorof the lines are all set to be unique for each condition. In other words, all three conditions will be plotted in one panel of each subject at once.
In this example, each condition is coded to a unique shape with the
function scale_shape_manual(). The first condition is coded
to the shape value of 21 (circle with borders), the second to the value
of 22 (square with borders), and the third to the value of 23 (triangle
with borders). Since these shapes have borders, the argument
fill determines their filling color, and color
determines their border color, which is set to transparent.
Similarly, with scale_fill_manual(), each condition is
color coded to a specific filling color. The colors are specified in the
object cList, which is defined outside the
lapply() function using the sm_palette()
function that returns three colors here. This function returns default
colors of the package and is equivalent to sm_color(),
except that it takes the number of colors as input instead of character
strings specifying the colors. Users are encouraged to find their own
color schemes from other packages, such as RColorBrewer and
viridis, available in the R ecosystem.
subj_list <- paste0("A", 1:9) # 9 subjects
cList <- sm_palette(3) # Three colors from smplot2 (defaults)
indv_plots <- lapply(1:length(subj_list), function(iSubj) {
# First part: Filter for each subject's data during each iteration
subj_data <- df2_amb %>%
filter(Subject == subj_list[iSubj])
# Second part: Plot each subject's data
ggplot(data = subj_data, aes(
x = logSF, y = absBP, group = Condition,
shape = Condition, fill = Condition, color = Condition
)) +
geom_line(linewidth = 1) +
geom_point(size = 5, color = "transparent") +
scale_color_manual(values = cList) +
scale_fill_manual(values = cList) +
scale_shape_manual(values = c(21, 22, 23)) +
sm_hgrid() +
scale_y_continuous(limits = c(0, 3)) +
scale_x_continuous(
limits = c(-1.3, 3.3),
labels = c(0.5, 1, 2, 4, 8)
)
})As lapply() performs function each time, a plot will be
generated. Each plot will get stored in the object
indv_plots, which is a list. Since
length(subj_list) is 9 and the input for the
lapply() function is a digit from from 1 to 9
(1:length(subj_list)), there will be nine iterations, and
hence, nine plots that will be generated.
When one codes for multiple subplots in a lapply()
function (second part of the structure), it is important to make the
limits of x- and y-axes identical. In the lapply()
function, both have been specified using
scale_y_continuous() (y-axis: 0 to 3) and
scale_x_continuous() (x-axis: -1.3 to 1.3). If there is no
specification of the limits, each plot will have its own limit based on
each subject’s data. Also, notice that although we plot data as a
function of the variable logSF, which has values of -1, 0,
1, 2 and 3, the tick labels of the x-axis are displayed as 0.5, 1, 2, 4
and 8. This is because the labels argument has been
supplied with these specifications in the function
scale_x_continuous(). Essentially, these inputs mask over
the true values of the x ticks on the plot. This is a common method of
plotting data in human vision studies because the visual system has been
known to process information non-linearly17, and it is specific to the
examples in the tutorial.
To display a plot from the object indv_plots, users can
type the name of the list indv_plots in the console, or
subset for one specific subject’s plot using double brackets (ex.
indv_plots[[3]]). This individual plot still has ticks for
x- and y-axes as well as their labels. However, these will be removed
automatically later or resized during the generation of a composite
plot.
Next, we define the title, as well as common x- and y-axes labels of
the composite figure that we will create. As their names suggest,
sm_common_title() sets the title of the combined figure,
sm_common_xlabel() sets the common x-axis label of the
combined plot, and sm_common_ylabel() sets the common
y-axis label of the combined figure. In these three functions,
x and y control the location of the texts.
Their defaults are set to x = 0.5, y = 0.5 which refers to
the center origin of their respective areas (x and
y do not refer to the coordinate relative to the combined
figure).
# Figure 4 - Set the title and axis labels of the composite figure
title <- sm_common_title("Individual data (subplotting with one factor)", x = 0.55, y = 0.52)
xlabel <- sm_common_xlabel("Spatial frequency (c/deg)", x = 0.52)
ylabel <- sm_common_ylabel("Visual deficit")Notice that this process is highly similar to what is often used in
Python’s matplotlib, where fig refers to the object that
stores the combined figure (see Pseudocode 4).
# Pseudocode 4: Titles and axis labels in Python's matplotlib (Figure 4)
fig.suptitle('Individual data', x, y)
fig.text(x, y, 'Spatial frequency (c/deg)') # x-axis label
fig.text(x, y, 'Visual deficit', rotation = 90) # y-axis labelFigure 4. A composite plot with three columns and three rows. Each panel is alloted to plotting each subject’s data across all three conditions. Legend is absent in this figure.
In Python’s matplotlib, the text labels and the title get
added to fig (the composite plot) using the object-oriented
approach. Here, we will use sm_put_together(), which is
essentially a layout function that creates a composite figure from
individual plots. The output from sm_put_together() must be
stored in an output object (ex. plots_tgd). Three inputs
must be provided to run sm_put_together(): 1) the list
object, which stores all plots (ex. all_plots argument =
indv_plots), 2) the number of columns (ex.
ncol argument = 3), and 3) the number of rows
(ex. nrow argument = 3). There are optional
arguments as well, such as title, xlabel and
ylabel. For instance, if title is not supplied
as input, then no space will be allocated for a common title in the
composite figure. Here, we will supply title in
sm_put_together(). In addition, the extent of blank spacing
(i.e., margin) in both width (wmargin) and height
(hmargin) can be adjusted (see Figure 2). In this example,
they are set as negative values to minimize the spacing between
subpanels.
# Figure 4 - Combine subplots into a specified layout
plots_tgd <- sm_put_together(
all_plots = indv_plots, title = title, xlabel = xlabel,
ylabel = ylabel, ncol = 3, nrow = 3,
wmargin = -4.5, hmargin = -4.5
)We can save the figure using ggsave() (see Figure 4). We
supply the name of the image file in strings
('together1.png') as the function’s first input, and the
object of the composite figure (plots_tgd) as its second
input. This forces the function save the object plots_tgd
as together1.png in your directory. Also, we set the
dimension of the image so that it has a height and a
width of 9 inches.
# Figure 4 - Save the composite output as a vector file
ggsave("together1.png", plots_tgd,
width = 9, # inches
height = 9
)Immediately in Figure 4, we can notice that the function
sm_put_together() has removed extraneous tick labels and
titles from both axes in the inner panels. This was possible because we
had provided the layout of the composite figure that was to be
constructed in sm_put_together(), which is similar to how
matplotlib controls the layout of the figure (see Pseudocode
2). Although the default of the function removes the extraneous ticks in
inner panels (remove_ticks = 'some' in
sm_put_together()), this option can be overwritten so that
all ticks are kept (remove_ticks = 'none') or removed
(remove_ticks = 'all'). The order of the subpanels follows
that of plots that are generated by the lapply() code
chunk. In this case, we set the layout to be
(ncol = 3, nrow = 3), so axis ticks in the 2nd, 3rd, 5th
and 6th panels in the composite figure are removed.
We can also label each panel by annotating each subject’s identifier
(ex. A1 for Subject 1 in Amblyopia; see Figure 5). There
are two ways of achieving this. The first way is revisit and modify the
code chunk that generates the nine points iteratively using
lapply(), but this goes against our aim of linearizing the
process of subplotting. Therefore, we will use the function
sm_panel_label() to label each panel.
# Figure 5 - Add subject identification label (ex. A1) in each panel
indv_plots_label1 <- sm_panel_label(
all_plots = indv_plots, x = 0.15, y = 0.85,
panel_pretag = "A", panel_tag = "1",
text_color = "black"
)The function sm_panel_label() has a few arguments, some
of which are similar to those of the former function. For the first
argument all_plots, users must provide the list vector
(indv_plots) that stores all plots. Next, x
and y determine the location of the panel label; 0.5 is the
origin of the panel (i.e., the center of each subplot).
panel_tag sets the string for enumeration. In this example,
we set panel_tag = "1" so that the first panel will have
“1” labelled but the next one will have “2”. There are other options to
enumerate each panel, such as: 1) panel_tag = "A" for
uppercase letters, 2) panel_tag = "a" for small case
letters, 3) panel_tag = "I" for upper roman numerals, and
4) panel_tag = "i" for lower roman numerals. These options
of the panel_tag argument were included as inspired by the
plot_annotation() function of the patchwork
package12. Also, there are tag
labels that can be set to be consistent across panels:
panel_pretag and panel_posttag. As their names
imply, panel_pretag comes before panel_tag,
and panel_posttag comes after panel_tag. These
two arguments can be any string at any lengths. To label each panel
using the subject’s identifier that is consistent with those in the data
frame df2 (ex. A1 and A3),
panel_pretag should be “A”. Then, we store the output from
sm_panel_label() in the object
indv_plots_label1. The differences between
plot_annotation() and sm_panel_label() are
that in sm_panel_label() 1) the locations can be specified
in x and y coordinates within but not outside
each panel, 2) annotations can be added multiple times (as demonstrated
in this example) in sequence, 3) the plot input must be a single list
object rather than separate ggplot2 objects.
We will also label each panel so that the first panel has “a)” and
the second panel has “b)”. To do so, we use the function
sm_panel_label() again, where we provide the
indv_plots_label1 object as input and set
panel_tag = 'a' and panel_posttag = ')'. This
creates labels with a small alphabet that is followed by a bracket in
each panel. The final output is then saved in the
indv_plots_label2 object, which is the end-result of
running sm_panel_label() twice.
Figure 5. A composite plot with two rows and five columns. Nine panels have been allocated to display each subject’s data in the Amblyopia group. The last panel has been left empty. The legend instead has taken over the empty space, showing that each condition has a unique color.
# Figure 5 - Add panel label in small alphabets followed by a bracket
indv_plots_label2 <- sm_panel_label(
all_plots = indv_plots_label1, x = 0.15, y = 0.7,
panel_tag = "a", panel_posttag = ")",
text_color = "black", fontface = "bold"
)Next, we sort the nine panels into a layout with five columns and two
rows (ncol = 5, nrow = 2) using the function
sm_put_together(). We also add the common title, common
x-axis label and common y-axis label by directly supplying character
strings rather than using sm_common_*() functions. This
option is less flexible but it is more convenient; the text size can
still be adjusted using the labelRatio argument, where 1
refers to the default size, but not its location. The
labelRatio argument does not affect the size of text labels
created from sm_common_*() functions.
sm_put_together() also supports combining subplots with
secondary x- and y-axes (not shown in the tutorial);
xlabel2 and ylabel2 should be provided to set
the titles for these axes.
# Figure 5 - Combine the subplots into one figure
plots_tgd2 <- sm_put_together(
all_plots = indv_plots_label2,
title = "Individual data (subplotting with one factor)",
xlabel = "Spatial frequency (c/deg)",
ylabel = "Visual deficit", ncol = 5, nrow = 2,
wmargin = -2, hmargin = -2, labelRatio = 0.9
)Now that a composite figure has been created with individual subplots
and labels, we will add a common legend in the combined figure
plots_tgd2. There are two ways to do so using
smplot2. There is a quick way and a slow but highly
customizable way. They both involve the function
sm_add_legend(). To preview, readers can compare the legend
in Figure 5 from the quick method with the legend in Figure 6 from the
slow method.
The first method of adding legend basically forces
sm_add_legend() to derive a legend from a reference plot so
that users do not have to manually make it. To make the legend using the
quick method, users should provide some inputs for some arguments. The
output from sm_put_together() (plots_tgd2)
must be supplied as input for the argument combined_plot.
The coordinate of the legend can be specified using x
(horizontal coordinate of the legend) and y (vertical
coordinate of the legend) arguments. Also, a reference plot from which
the legend can be derived must be supplied for the argument
sampleplot (i.e., one plot from indv_plots).
In this example, the coordinate is set to be within the area of the
empty 10th panel (x=0.92, y=0.35); the sample
plot is derived from the first subject’s plot
(indv_plots[[1]]). The direction argument
(i.e., orientation) of the legend is specified to be
vertical, not horizontal.
legend_spacing is an argument that can set the extent of
blank space within the legend to prevent overcrowding. If
border = FALSE, then the border of the legend will be
removed. The font size of the legend can be adjusted using the argument
font_size. The code below stores the output from
sm_add_legend() in the object
plots_tgd2_legend, and then saves the figure as a vector
file using the ggsave() function with specified
width and height.
# Figure 5 - Legend in the area of the 10th panel
plots_tgd2_legend <- sm_add_legend(
combined_plot = plots_tgd2, x = 0.92, y = 0.35,
sampleplot = indv_plots[[1]], direction = "vertical",
legend_spacing = 1, border = TRUE, font_size = 13
)
# Figure 5 - Save the composite figure as a vector file
ggsave("together2.png", plots_tgd2_legend,
width = 15, # inches
height = 6.6
) We can make two observations from this legend in Figure 5. First, the
legend’s title matches to one of the column’s name
(Condition) in the data frame df2. Second,
labels within legends are identical to the string characters that are
provided in the Condition column of df2. So,
these similarities indicate that the legend’s title and labels have been
automatically generated according to the given data frame. If the legend
is created in this quick approach (by forcing
sm_add_legend() to derive one from a sample plot), the
title and the labels cannot be customized although the title can be
removed.
# Compute the average and standard error for each SF and Condition level
df2_amb_avg <- df2_amb %>%
group_by(logSF, Condition) %>%
summarise(
avgBP = mean(absBP),
stdErr = sm_stdErr(absBP), .groups = "drop"
)
head(df2_amb_avg)## # A tibble: 6 × 4
## logSF Condition avgBP stdErr
## <dbl> <fct> <dbl> <dbl>
## 1 -1 One 0.0769 0.0182
## 2 -1 Two 0.283 0.0649
## 3 -1 Three 0.199 0.0720
## 4 0 One 0.234 0.151
## 5 0 Two 0.491 0.0705
## 6 0 Three 0.374 0.135
Since we have one empty panel that is available for plotting (10th
panel of Figure 5), we can add an additional panel, which shows the
average data of the nine subjects with error bars (ex. standard error).
This panel showing the average data should have the same x- and y-limits
as those of the individual subjects’ panels. Next, we compute the
average and standard errors of the data from nine individuals for each
independent variable (logSF) and experimental condition
(Condition), and store the resulting data frame into the
object df2_amb_avg. The initial step can be achieved using
functions from the dplyr package, such as
group_by() and summarise().
group_by() does not change the data frame at the surface
level. Instead, it changes its underlying structure so that the
following functions that will be called later for computations within
summarise() will be done separately for each grouped
variable’s level. The two computations - mean and standard error - are
conducted using the functions mean() and
sm_stdErr(), respectively. The latter is a shortcut
function from the smplot2 package. The grouping will remain
even after the computation has been performed, so it is crucial to undo
the grouping by setting .groups = 'drop' in
summarise(). More information about these functions can be
found in Chapter 7 of the documentation webpage (https://smin95.github.io/dataviz).
# Figure 6 - 10th panel showing the average data
avg_plot <- ggplot(data = df2_amb_avg, aes(
x = logSF, y = avgBP, group = Condition,
shape = Condition, fill = Condition, color = Condition
)) +
geom_line(linewidth = 1) +
geom_point(size = 5, color = "white", stroke = 1) +
geom_linerange(aes(ymin = avgBP - stdErr, ymax = avgBP + stdErr), linewidth = 1) +
scale_color_manual(values = sm_palette(3)) +
scale_fill_manual(values = sm_palette(3)) +
scale_shape_manual(values = c(21, 22, 23)) +
sm_hgrid() +
scale_y_continuous(limits = c(0, 3)) +
scale_x_continuous(
limits = c(-1.3, 3.3),
labels = c(0.5, 1, 2, 4, 8)
) +
annotate("text", label = "Average", x = -0.3, y = 2.65, size = 5.5,
fontface='bold')Figure 6. A composite plot with two rows and five columns with a common legend that is located at the bottom-right area of the figure. The first nine panels have been assigned to display individual’s data from the Amblyopia group, whereas the last panel has been assigned to show the average data with error bars, which represent standard errors.
With the newly created data frame df2_amb_avg, we can
plot the average data using the same mapping specifications as those in
the individual plots in the lapply() function. Average data
are plotted as points with white borders using the
geom_point() function. The lines are drawn to join the
points with geom_line(), and the error bars without caps
are displayed using geom_linerange(), which is a useful
function for indicating intervals of some range. The aesthetic mapping
is defined in geom_linerange() so that vertical lines with
certain ranges can be plotted at each level of logSF
(x-axis); we explicitly specify the minimum
(ymin = avgBP - stdErr) and maximum
(ymax = avgBP + stdErr) of the vertical range to be equal
to the range of the standard error of the average data. We do not use
lapply() function here because we need to make one
plot.
# Figure 6 - Combine all the subplots into a composite plot
all_plots <- list(indv_plots_label1, avg_plot)
plots_tgd3 <- sm_put_together(
all_plots = all_plots,
title = "Individual data and average (subplotting with one factor)",
xlabel = "Spatial frequency (c/deg)",
ylabel = "Visual deficit", ncol = 5, nrow = 2,
wmargin = -4.5, hmargin = -4.5, labelRatio = 0.9
)The limits of both x- and y-axes, as well as the thematic background
(i.e., sm_hgrid()), are set to be identical to those of the
individual plots. Also, we annotate the average plot with the bolded
text 'Average' using annotate(), where we can
specify its coordinate to be at the top-left of the panel
(x = -0.3, y = 2.65) in the units of the plotted data
(x = logSF, y = avgBP). The plot output is
then saved in the object avg_plot.
Then, we store all ten plots (9 individuals’ plots in
indv_plots_label1 + one average plot in
avg_plot) that we have generated into one list
using the function list() and then assign the output to the
object all_plots. The all_plots object will be
the input for sm_put_together(), which will create a
composite plot using the plots, title and axis labels with a layout
(ncol = 5 and nrow = 2).
Since we have ten panels to plot in a layout with five columns and
two rows, there should already be a limited amount of available space
for the legend (see Figure 6 for our final output). So, to effectively
use the remaining plotting space, we will have to build and customize a
legend using the function sm_common_legend() rather than
relying on the automatically generated legend from
sm_add_legend(). After creating a legend manually, we can
then add it to the composite plot using sm_add_legend() at
a specific location within the combined figure. This option requires
more work but it is more flexible.
To do so, we need to essentially create a new plot using the standard
procedure of ggplot2 (see codes below). This includes setting
the mapping the x and y variables to certain
aesthetics. Points are also drawn using geom_point() so
that they are included in the legend. The legend labels have also been
changed, as specified in the two scale_*() functions.
Finally, we finish creating the legend by using
sm_common_legend(), which essentially hides all features of
a normal graph, such as points and axis lines that will be plotted
otherwise. As a result, the output legend2 only prints the
legend components when it gets called. We set the legend to have a
horizontal orientation with no borders
(border = FALSE). The text size of the legend can also be
adjusted using the argument textRatio, which has been set
to 1.1 in this example; this means that the text size of the legend is
1.1x larger than the default from a given theme. Lastly,
legend_spacing controls the amount of blank space in the
legend.
# Figure 6 - Create a legend manually
legend2 <- ggplot(data = df2_amb, aes(
x = logSF, y = absBP, group = Condition,
shape = Condition, fill = Condition
)) +
geom_point(size = 4.5, color = "white") +
scale_fill_manual(
values = sm_palette(3),
labels = c(
"Condition 1 ", "Condition 2 ",
"Condition 3 "
)
) +
scale_shape_manual(
values = c(21, 22, 23),
labels = c(
"Condition 1 ", "Condition 2 ",
"Condition 3 "
)
) +
sm_common_legend(
title = FALSE, direction = "horizontal", border = FALSE,
textRatio = 1.1, legend_spacing = .9
)The customized legend can be added to the composite plot with the
function sm_add_legend() at a specific coordinate
(x = 0.84, y = 0.05; bottom-right region of Figure 6).
Since we have manually created the legend with
sm_common_legend(), there is no need for us to supply
inputs for other arguments in sm_add_legend(), such as
direction, border and sampleplot,
all of which will be ignored. The final output - a composite figure that
shows both individual plots and a panel that shows the average data
(Figure 6) - is saved using ggsave() from the
ggplot2 package.
# Figure 6 - Save the figure with a legend as a vector file
plots_tgd3_legend <- sm_add_legend(
combined_plot = plots_tgd3, legend = legend2, x = 0.84,
y = 0.05
)
ggsave("together3.png", plots_tgd3_legend,
width = 15, # inches
height = 6.6
) Readers might realize that they could also generate Figures 4-6 with
facet_wrap(). Indeed, when subplotting data using one
variable, using facet_wrap() might be simpler. However, the
advantage of using smplot2’s pipeline with
lapply() is that it remains very similar even if more
variables or lapply() functions are added (next two
examples).
Example 2: Subplotting Data Using Two Variables
Thus far, we have only explored a relatively simple way of assigning
data to each panel. In this example, we will allocate data to each panel
using two factors (Condition and Subject
Group).
In this example, the same dataset (df2) will be used
albeit with some data transformations. Average data at each level of
condition and subject group will be plotted. There are three
experimental conditions and two groups, totaling to six combinations of
levels from the two variables. Therefore, the data will be allocated to
six separate panels.
df2_avg <- df2 %>%
mutate(logSF = log2(SF)) %>%
mutate(Condition = factor(Condition, levels = c("One", "Two", "Three"))) %>%
group_by(logSF, Condition, Group) %>%
summarise(
avgBP = mean(absBP),
stdErr = sm_stdErr(absBP), .groups = "drop"
)
head(df2_avg)## # A tibble: 6 × 5
## logSF Condition Group avgBP stdErr
## <dbl> <fct> <chr> <dbl> <dbl>
## 1 -1 One Amblyopia 0.0769 0.0182
## 2 -1 One Normal 0.149 0.0491
## 3 -1 Two Amblyopia 0.283 0.0649
## 4 -1 Two Normal 0.287 0.0707
## 5 -1 Three Amblyopia 0.199 0.0720
## 6 -1 Three Normal 0.244 0.0868
To begin with, the original data frame df2 is
transformed similarly as in the previous example by creating another
column for spatial frequency in log-scale to achieve equal spacing
(logSF column) and re-ordering the level of the
Condition column to its proper, numerical order by
converting it into factor from strings
('One'-'Two'-'Three'). Next, the
codes compute the average and standard error for each combination of the
two variables. This is possible because the underlying structure of the
data frame is transformed using group_by() so that
subsequent computations for average and standard error on these data in
summarise() are performed according to the specified
groupings. As in Example 1, the mean is computed using
mean() and the standard error is computed using
sm_stdErr().
In this example, there will be two levels of lapply()
structure in the code fragment because we will perform subplotting with
two variables (Group and Condition). Hence,
the code structure will have one inner function and one outer function.
This is better known as a nested structure, which involves
using functions in a hierarchical fashion. The outer function will
iterate around the variable Group, and the inner function
around Condition. This structure of the nested functions
will affect the order in which the plots will be generated and stored in
the object avg_plots. Specifically, plots from the first
level of Group and all three levels of
Condition will be generated first, followed by those from
the second level of Group.
With the structure of the nested lapply() functions in
mind, we can first create vectors that contain string elements that
match the identifiers of Group and Condition
columns from the df2_avg data frame. These are then stored
in group_list and cond_list objects,
respectively. Each iteration of the nested functions will filter the
average data based on the selected element of group_list
and cond_list from their indices, ex.
Group == group_list[[iGroup]], where iGroup =
1, and therefore, Amblyopia.
# Figure 7 - Visualize each subplot
group_list <- c("Amblyopia", "Normal")
cond_list <- c("One", "Two", "Three")
shape_list <- c(21, 22, 23) # Shape for each condition
cList <- list(
c("#ddc7d8", "#d3a7c0", "#b7729a"), # Color for each subject group
c("#bababa", "#999999", "#636262")
)
avg_plots <- lapply(1:length(group_list), function(iGroup) {
lapply(1:length(cond_list), function(iCond) {
# First part: Filter average data for each group & condition during each iteration
currData <- df2_avg %>%
filter(Condition == cond_list[iCond]) %>%
filter(Group == group_list[iGroup])
# Second part: Plot the filtered average data
pp <- ggplot(data = currData, aes(x = logSF, y = avgBP)) +
geom_area(fill = cList[[iGroup]][[iCond]], alpha = 0.3) +
geom_line(linewidth = 1, color = cList[[iGroup]][[iCond]]) +
geom_point(
size = 5, shape = shape_list[[iCond]], color = "white",
fill = cList[[iGroup]][[iCond]], stroke = 1
) +
geom_linerange(aes(ymin = avgBP - stdErr, ymax = avgBP + stdErr),
linewidth = 1, color = cList[[iGroup]][[iCond]]
) +
scale_y_continuous(limits = c(0, 1.6)) +
scale_x_continuous(
limits = c(-1.3, 3.3),
labels = c(0.5, 1, 2, 4, 8)
) # pp is the intermediate plot output
# Third part (optional): Apply different themes based on subject grouping
if (group_list[iGroup] == "Amblyopia") {
pp + sm_minimal() # No grids for Amblyopia
} else {
pp + sm_hgrid() # Horizontal grids for Control
}
})
})In addition, shapes are set to be unique for each of the three
experimental conditions; their values are stored in the
shape_list vector, and each value will get selected during
the iteration for each condition to specify the shape when plotting (ex.
shape = shape_list[[iCond]]). The color palettes for the
two subject groups are set to be different, and the intensity of the
color is set to increase as a function of Condition. The
color values (in hex codes) are stored in the list vector
cList, which contains six different colors that have been
separated into two vectors (one for each Group). Therefore,
if the iteration has an index for the first level of Group
and the second level of Condition, the corresponding color
will be cList[[1]][[2]], where cList[[1]]
contains three colors that are in the pink palette in the increasing
intensity. Here, the first level of Group is
Amblyopia because the first element of
group_list is Amblyopia. Lastly, using
if conditional statements, we only allow subplots of
Normal group’s data to have horizontal grids but not those
of Amblyopia. This is possible because the intermediate
plotting output is stored as in the second part of the
lapply function. Thematic functions are then added
modularly to pp in the third part of the
lapply function, creating a final output. The third part is
optional to perform subplotting, and it can be useful to set specific
customizations. Integrating the programmatic approach for plotting
allows us to dynamically control aesthetics, such as color, shape and
theme (Figure 7), which is very difficult to do in ggplot2
unless users code plots separately.
As in Example 1, the lapply() function must contain two
parts. The first part filters for the data of interest, which are
average data at each group and condition. The filtered data is stored in
the object currData. Then, the second part of the function
plots the data from currData. Specifically, the average
data are plotted as points using geom_point(), whereas the
associated standard error values are drawn in vertical lines using
geom_linerange() (as explained in Example 1). Here, we use
an additional function from the ggplot2 package:
geom_area(), which plots area (i.e., filled line plots).
The function essentially fills the area below the lines of the plot with
colors. Coloring the area is useful to illustrate the magnitude of the
data. In this example, we make it transparent to some extent by setting
alpha = 0.3; if alpha = 1, the colored area
will be opaque. Furthermore, the x- and y-axes limits are set to be
identical for all panels using scale_x_continuous() and
scale_y_continuous() functions because the panels will get
combined into one composite figure with shared tick labels. Finally, the
theme is set to sm_hgrid() to optimize the aesthetics of
each panel for subplotting.
Figure 7. A composite plot with two rows and three columns, showing the average data from each condition and group with error bars (standard error). The first row shows data of the Amblyopia group, whereas the second row shows the data of the Normal group, as specified in the lapply() function. The main and secondary titles have been added as annotations.
Notice that in this example, the object avg_plots is a
list of list, where each element is a list containing three
plots. So, it has a length of 3 even if it stores six plots in total.
However, the function sm_put_together still recognizes it
as a list with six elements (i.e., plots) because the function
automatically flattens each element if the element is a list
itself. So, there is no need for us to manually reorganize the structure
of the object avg_plots for the function
sm_put_together() to operate. It is important to be aware
that the order of the plots that will be used in the composite figure
from sm_put_together is: avg_plots[[1]][[1]],
avg_plots[[1]][[2]], avg_plots[[1]][[3]],
avg_plots[[2]][[1]], avg_plots[[2]][[2]], and
avg_plots[[2]][[3]]. So, if we are to subplot these panels
in a 2x3 figure, then three plots from avg_plots[[1]] will
be on the first row, whereas three plots from
avg_plots[[2]] will be on the second row.
Next, we set y-axis label of the combined figure as in Example 1 by
directly providing character strings in sm_put_together().
However, for the xlabel, we use the output created from
sm_common_xlabel(), demonstrating that both options of
labelling the axes can work in concert. Here, the argument
wRatio controls the width of the leftmost column to those
of other columns. The value exceeds the value of 1 because the panels in
the leftmost column have y-axis ticks, capturing additional plotting
space. If an input for this argument is missing, the function by default
adjusts a width ratio using the information about the composite plot,
such as the number of lines and characters in the tick labels. The
argument ylabel2 has an input of an empty string because,
if the input is supplied in any form (even when it is empty), some space
will be spared on the right side of the composite plot (area for labels
of secondary y-axis), where we will add labels of the two subject
groups. As previously noted, labelRatio only affects axis
labels that are created directly from sm_put_together(), so
it will adjust the size of ylabel but not
xlabel.
# Figure 7 - Combine the subplots and specify the layout
xlabel <- sm_common_xlabel("Spatial frequency (c/deg)", x = 0.52)
avg_plots_tgd <- sm_put_together(
all_plots = avg_plots,
title = "", # Spare space for title
xlabel = xlabel,
ylabel = "Visual deficit",
ylabel2 = "", # Spare space for group label
ncol = 3, nrow = 2, wRatio = 1.1, wmargin = -2, hmargin = -2,
labelRatio = 0.95 # Text size of the ylabel
) In this example, notice that we also did not supply a character
string for the main title (title argument) of the combined
plot in sm_common_title(). By putting an empty string, we
merely allocated some space for the title at the top of the figure,
where we will add text annotations using sm_add_text(). We
can set the coordinate of the title to be at the center of the x-axis
and top along the y-axis (x = .55, y = .98, where 0.5
represents the origin of the combined figure) and its
fontface to be bold. The text annotation
itself can be defined using the label argument within
sm_add_text().
As for the group labels, we can also use sm_add_text()
to denote the two subject groups by setting the orientations of the
texts at 270 degrees relative to the horizontal axis using the
angle argument. We position them on the right side of the
composite figure by setting x = 0.93.
Next, because we have assigned data to multiple panels using two
factors (groups and conditions), it leaves us with one more factor
(Condition) to label in the composite plot. Here, we can
add a sub-title at the top of each column where we label each condition
(as shown in Figure 7) using sm_add_text(). When using
sm_add_*() functions, the coordinate is uniform regardless
of the size of the composite plot output that is generated from
sm_put_together() or sm_add_legend() (0 to 1 ;
x = 0.5, y = 0.5 is the center); essentially, the
annotations can be added to the composite figure similarly as to how
geom objects can be added together to form a
ggplot2 object with a common coordinate.
# Figure 7 - Add text annotations
avg_plots_tgd1 <-
avg_plots_tgd + # Composite plot
sm_add_text(
label = "Average data (subplotting with two factors)", # Main title
x = 0.53, y = 0.98, fontface = "bold", size = 17
) +
sm_add_text(label = "Condition 1", x = 0.25, y = 0.92, size = 14) + # Sub-title for Column 1
sm_add_text(label = "Condition 2", x = 0.51, y = 0.92, size = 14) + # Sub-title for Column 2
sm_add_text(label = "Condition 3", x = .78, y = .92, size = 14) + # Sub-title for Column 3
sm_add_text(label = "Controls", x = 0.93, y = 0.335, angle = 270, size = 15) + # Group label
sm_add_text(label = "Amblyopia", x = 0.93, y = 0.705, angle = 270, size = 15) # Group labelThe final figure is stored in the object avg_plots_tgd1,
which then gets saved as an image using the ggsave()
function from the ggplot2 package.
Example 3: Complex Subplotting Using Separate lapply()
Functions
In this example, using the data frames df2_amb and
df2_avg from the previous examples, we will create a
composite figure that plots the data of individuals in the
Amblyopia group in a slightly more complex way that is not
currently possible with ggplot2 or its third-party packages
that enhance its faceting functions. This time, we will allocate the
data from each condition of each individual to a unique panel, as well
as plot the average data for each condition on a unique panel.
# Figure 8 - Generate three subplots for each subject
subj_list <- paste0("A", 1:9) # 9 subjects
cond_list <- c("One", "Two", "Three")
shape_list <- c(21, 22, 23)
cond_cList <- c("#ddc7d8", "#d3a7c0", "#b7729a")
indv_plots <- lapply(1:length(subj_list), function(iSubj) {
lapply(1:length(cond_list), function(iCond) {
# First part: Filter data for each subject & condition during each iteration
subj_data <- df2_amb %>%
filter(Subject == subj_list[iSubj]) %>%
filter(Condition == cond_list[iCond])
# Second part: Plot the filtered data
ggplot(data = subj_data, aes(x = logSF, y = absBP)) +
geom_area(fill = cond_cList[[iCond]], alpha = 0.3) +
geom_line(linewidth = 1, color = cond_cList[[iCond]]) +
geom_point(
size = 5, shape = shape_list[[iCond]],
color = "transparent", fill = cond_cList[[iCond]]
) +
sm_hgrid() +
scale_y_continuous(limits = c(0, 3)) +
scale_x_continuous(
limits = c(-1.3, 3.3),
labels = c(0.5, 1, 2, 4, 8)
) +
annotate("text",
label = paste0("A", iSubj), x = -0.9, y = 2.65, size = 5.5,
hjust = 0
)
})
})As the title for this example implies, we will create two separate
lapply() functions to build a composite figure. With the
data frame df2_amb, the first function will iterate through
each subject’s data at each condition, creating 27 plots (9 subjects x 3
conditions). With the data frame df2_avg, another
lapply() function will iterate through the average data at
each condition, generating 3 plots (3 conditions). These outputs will
then be combined and stored in a single object, which will then be used
as input by the layout function sm_put_together() to create
a composite plot of 30 subplots, all of which will have the same x- and
y-axes limits.
To generate a plot for each subject at each condition using
df2_amb, a nested lapply() structure should be
used, with one inner function and one outer function (as described in
Example 2). Data can be filtered similarly as in Example 2 during each
iteration. The structure of the nested functions will determine the
order in which the figure outputs will be created and stored in the
output (i.e., indv_plots). In this example, the first three
outputs will be plots using data from the first element of
subj_list and all three elements from
cond_list because the latter has been used to iterate
through the inner lapply() function. Aesthetics can also be
dynamically controlled using the programmatic approach. For example,
different shapes and colors can be set to represent each condition, as
specified by the order of objects shape_list and
cond_cList. The figures that are generated from this
lapply() function are stored in the indv_plots
object.
In Example 1, we annotated each panel with the subject’s identifier
using the function sm_panel_label() because we had
forgotten to include codes that add panel label inside the nested
lapply() code fragment (Figures 5 & 6). Here, we use an
alternative method; the annotation label on each panel is defined within
the lapply() code fragment that generates the individual
panels in sequence. Specifically, this can be performed using the
function annotate(), which is from the ggplot2
package. We specify its annotation type as text for the
first input, and the label as each subject’s identifier by
concatenating the string A with the index of the subject
during each iteration (iSubj) with the function
paste0(). The coordinates of x and
y are set in the units of the data that are plotted so that
the label annotations are on the top-left of each panel. The argument
hjust aligns the text to the left because we have set it as
0. If hjust is set to 1, the text label will be aligned to
the right. After writing the codes, readers can check if the
indv_plots object correctly stores each subject’s plot at
each condition with the panel label of each subject’s identifier (ex. A1
in the first plot of indv_plots).
# Figure 8 - Generate a subplot for each condition's average across subjects
avg_plots_amblyopia <- lapply(1:length(cond_list), function(iCond) {
# First part: Filter average data for each condition
currData <- df2_avg %>%
filter(Group == "Amblyopia") %>%
filter(Condition == cond_list[iCond])
# Second part: plot the filtered data
ggplot(data = currData, aes(x = logSF, y = avgBP)) +
geom_area(fill = cond_cList[[iCond]], alpha = 0.25) +
geom_line(linewidth = 1, color = cond_cList[[iCond]]) +
geom_point(
size = 5, shape = shape_list[[iCond]], color = "white",
fill = cond_cList[[iCond]], stroke = 1
) +
geom_linerange(aes(ymin = avgBP - stdErr, ymax = avgBP + stdErr),
linewidth = 1, color = cond_cList[[iCond]]
) +
sm_hgrid() +
scale_y_continuous(limits = c(0, 3)) +
scale_x_continuous(
limits = c(-1.3, 3.3),
labels = c(0.5, 1, 2, 4, 8)
) +
annotate("text",
label = "Average", x = -0.9, y = 2.65, size = 5.5,
hjust = 0, fontface = "bold"
)
})Next, we construct codes that generate plots using the average data
(from df2_avg) at each condition. This requires a single
lapply() structure, looping through each condition. Data
can be filtered similarly as in the previous examples. For the average
panels, we establish the aesthetics so that the points have white border
lines. thereby accentuating the error bars. Also, we add annotations
with the bolded text 'Average'. So, there will be three
iterations total from this lapply() function. The order of
the figure outputs will follow the order of the elements in
cond_list. The three output figures are stored in the
object avg_plots_amblyopia.
In the lapply() function, as that in Example 2, the
average data are plotted as points using geom_point(), the
lines that connect the points are drawn using geom_line(),
the areas below the lines are filled with colors and some transparency
using geom_area(), and the range of standard error across
subjects is displayed in vertical lines using
geom_linerange(). The ticks and limits of both x- and
y-axes are set to be consistent across panels.
# Figure 8 - Put together all subplots from individual and average data
all_plots1 <- list(indv_plots, avg_plots_amblyopia) # Combine all plot outputs in a list
composite_plot <- sm_put_together(
all_plots = all_plots1,
title = "Individual and average data (two separate functions)",
xlabel = "Spatial frequency (c/deg)",
ylabel = "Visual deficit",
ncol = 6, nrow = 5, wmargin = -5, hmargin = -5,
labelRatio = 0.95 # Text size of the axes' label
) We then combine two objects (indv_plots and
avg_plots_amblyopia) from the two lapply()
structures into a single object (all_plots1) using the
function list(). The object all_plots1 will
then used as input for sm_put_together(). Notice that since
indv_plots list has been generated from a nested
lapply() structure, each of the nine elements in the list
contains three plots (hence, 27 plots total). Conversely,
avg_plots_amblyopia is from a single lapply()
function, so there are three elements in the list, and each element
stores one plot (hence, three plots total). In other words, these two
lists have different underlying structures. However, this is not an
issue because sm_put_together() will automatically
flatten different structures of list (ex. list of list) into a
uniform list structure, thereby making it easier for users to use the
function when they have used multiple, separate lapply()
structures to generate numerous subplots.
Figure 8. A composite plot with five rows and six columns. Twenty-seven panels are allocated for displaying each subject’s data from each condition. The last three panels are assigned to show the average data with error bars, which denote standard errors. A borderless common legend has been created and placed at the bottom-right corner of the composite figure.
Thirty panels that have been generated with the two separate
lapply() iterations are combined using
sm_put_together(), where ncol = 6, nrow = 5
are set as the layout of the composite figure. Also, the main title,
x-axis label and y-axis label are all defined and then integrated into
its composite form. Next, we create a custom legend using
sm_common_legend() as we have done so in the previous
examples. This legend is set to have a horizontal
orientation with no border (border = FALSE). The legend is
coded so that the fill and shape aesthetics
are mapped to each level of Condition. Then, we can add it
to the object composite_plot using the function
sm_add_legend() at the bottom-right corner of the composite
plot (x=0.85, y=0.065 of composite_plot).
# Figure 8 - Make a legend
legend3 <- ggplot(data = df2_amb, aes(
x = logSF, y = absBP, group = Condition,
fill = Condition, shape = Condition
)) +
geom_point(size = 5, color = "white") +
scale_fill_manual(
values = cond_cList,
labels = c("Condition 1 ", "Condition 2 ", "Condition 3 ")
) +
scale_shape_manual(
values = c(21, 22, 23),
labels = c("Condition 1 ", "Condition 2 ", "Condition 3 ")
) +
sm_common_legend(
direction = "horizontal", border = FALSE,
textRatio = 1.2, legend_spacing = 0.9
)
composite_plot2 <- sm_add_legend(combined_plot = composite_plot, x = 0.85, y = 0.065,
legend = legend3)Finally, we can add other types of annotations (besides
sm_add_text() and sm_add_point()) using
ggplot2 functions directly. Here, we will add two rectangles to
the composite plot using the function annotate(). The
coordinate system works similarly to how sm_add_*()
functions work; when x and y are 0.5, annotations are drawn at the
origin of the composite plot. The annotate() function
requires inputs for some arguments. The first input is the type of
geom, which has to be written as 'rect' to
draw a rectangle on the plot; the coordinates of the rectangle are
specified with xmin. xmax, ymin
and ymax arguments, all of which should be from 0 to 1.
Their border color and filling color can also
be specified. In this example, both of these rectangles have no filling
color (fill = NA) but have different border colors at
different locations (set with x and y inputs).
They span areas of multiple subplots, demonstrating that users have a
full control for aesthetics. Outputs from sm_put_together()
and sm_add_legend() are treated it as a single layer of
ggplot2 with a normalized coordinate from 0 to 1, so users can
also use functions from third-party packages to perform particular types
of annotations.
# Figure 8 - Add annotations of shapes
composite_plot2b <- composite_plot2 + # Composite plot with legend
annotate("rect", # Rectangle 1
xmin = 0.23, xmax = 0.63, ymin = 0.54, ymax = 0.57, fill = NA,
color = "#636262", linewidth = 0.8
) +
annotate("rect", # Rectangle 2
xmin = 0.56, xmax = 0.93, ymin = 0.22, ymax = 0.25,
fill = NA, color = "#b7729a", linewidth = 0.8
) Lastly, the final figure is stored in the object
composite_plot2b, which is then saved as an image using
ggsave() with defined width and
height of the composite figure.
# Figure 8 - Save the composite graph as a vector file
ggsave("composite_plot.png", composite_plot2b,
width = 18, # inches
height = 16.5
) Through these examples, I have shown that the workflow for complex
data visualization in ggplot2 can be structurally linear, with
its clear beginning and resolution. Also, the examples have illustrated
that the limitations of how we can allocate different subsets of data to
distinct subplots and dynamically control the aesthetics are not
determined by what ggplot2 and its third-party packages are
capable of but by our own ability to apply the programmatic approach
using the lapply() function. I hope that this will empower
readers to programmatically perform subplotting in creative and
limitless ways. Readers can refer to Table 1 for the summary of the
tutorial whenever they revisit the tutorial.
Discussion
In this tutorial, I have demonstrated how smplot2 can improve the user experience for data visualization using ggplot2, both in coding standalone and composite plots. Specifically, the package can be useful for both beginners who wish visualize their data with elegant aesthetics and advanced users who wish to structure their workflow for drawing composite figures with programmatic approaches and extend their level of customization. In the long-term, the package can provide users a flexible and programmatic approach of plotting data that could yield more diverse, expressive and powerful visualizations across different fields, including psychology and human neuroscience.
Key advantages of smplot2
smplot2 package can provide benefits to both entry-level and advanced R users.
To begin with, a major advantage of smplot2 for incoming users, as noted by a recent review from a group of clinicians18, is that it flattens the learning curve of ggplot2 (item #1 in Table 2). The visualization functions are flexible, and their aesthetics have been optimized for the general format of scientific journals19. More than 300 reproducible examples are provided in the documentation page (https://smin95.github.io/dataviz), so users can freely use and modify these codes for their own purposes. In addition, the codes of the package have been reviewed for quality and stability across different computing systems by CRAN. Some of the functions that users from eclectic fields and levels of experience have used are raincloud plots20,21, regression analyses22,23, forest plots24 in both standalone and composite forms.
For users with working knowledge of R and ggplot2,
smplot2 has potential to impact how they perform complex and
sophisticated data visualizations. Specifically, it provides key
functions for them to integrate the practices of data visualization
using ggplot2 and the programmatic approach as smplot2
overcomes the limited flexibility of aesthetics at the level of
composite figures in ggplot2. Namely, it provides a complete,
flexible and linear workflow for combining multiple ggplot2
outputs into a composite plot. It also integrates the programmatic
approach, which can generate multiple ggplot2 outputs, into the
visualization pipeline by handling different (nested) structures of list
objects from lapply() (or map()) functions or
other methods that are compatible with ggplot2. Furthermore, it
enables users to adjust marginal space and annotate both within and
across subplots in any form after a composite plot has been constructed,
encouraging users to apply the programmatic approach rather than to
create each plot separately. Therefore, it will motivate users to search
for solutions within their scripts rather than to find third-party
packages that resolve issues in plotting. In sum, the package has
potential to empower users by allowing them to create more customizable,
dynamic and expressive figures, promoting the reproducibility of complex
visualization routines, and linearizing the workflow of visualizing a
composite plot.
Numerous packages, such as ggfortify6, ggstatsplot8 and GGally25, have been developed to allow users to easily plot data using different types of graphs in a few lines of codes (shortcut functions; item #2 in Table 2), thereby extending the functionalities of ggplot2 and flattening the steep learning curve for beginners. There are also packages, such as grid, patchwork and gridExtra, that provide functions for users to create composite figures in ggplot2 in various layouts from combining discrete ggplot2 objects (item #3 in Table 2). This approach has been widely used so that users can achieve a maximum flexibility of aesthetics of the combined figure (see Figure 1). Nevertheless, they do not offer significant versatility for users after subplots (multiple ggplot2 objects) have been combined into a composite plot, thereby encouraging users to create plots separately and shirking away from applying programmatic practices (item #4 in Table 2). For instance, after multiple ggplot2 objects are combined into one form, controlling the positions of legends and annotations, as well as extent of margin between subplots, in the combined figure becomes more difficult (items #5-7 in Table 2), a task that can be easily performed in Python’s matplotlib. This has made users to implement practices that go against the principles of open-science, such as using a vector graphics software (ex. Adobe Illustrator) to annotate the final figure generated from R. These restrictions can now be lifted with smplot2, which linearizes the workflow for complex data visualizations (item #8 in Table 2) and elevates the level of customization for aesthetics in situations where users want to stitch multiple ggplot2 objects together to construct a composite plot.
R or Python?
The dispute about which of ggplot2 and matplotlib is better for data visualization has been ongoing for some time26. A well-known plotting package that complements matplotlib, seaborn27 has captivated a wide user base in Python with its beautiful aesthetics and shortcut functions for plotting. The two libraries (seaborn and matplotlib) embrace the programmatic approach, requiring users to apply iterations and conditional statements to plot data. Although this steepens the learning curve for users, it increases flexibility for aesthetics, allowing users to dynamically control each component of the figure. A comparable library to matplotlib in R is ggplot2, which is convenient for plotting different types of graphs without the requirement for users to understand concepts of programming, such as loops and functional methods, primarily due to its layered approach. This simplicity, along with the fact that ggplot2 can generally reproduce figures from matplotlib with less lines of code, has expanded its user base rapidly (see Figure 1). However, this layered approach comes at a cost because it hinders users from controlling the aesthetics using the programmatic approach. Although ggplot2 can be superior in many aspects of visualizations to matplotlib, notably for concisely plotting different types of graphs with its declarative syntax, its design can complicate the workflow for users when it comes to subplotting and creating composite figures, leaving Python’s matplotlib slightly more suitable for performing complex visualizations (see Table 2 for their comparisons).
Throughout the tutorial, I have compared Python’s matplotlib and R’s ggplot2 closely to demonstrate that the gap between ggplot2 and matplotlib has been minimized with smplot2 in the realms of subplotting and flexibility. With the arrival of smplot2, it is now possible to linearize the process of subplotting with its clear starting and ending points because the package integrates the interface of ggplot2 and the programmatic approach.
Why use the programmatic approach?
So far, the programmatic approach has not been ideal in ggplot2 because it creates various ggplot2 objects that need to be joined together using other packages. Unfortunately, the level of aesthetic control decreases steeply from when a plot is built as a single ggplot2 output to when multiple outputs are combined into a composite figure, encouraging users to generate each subplot separately. However, in this tutorial, I have demonstrated the efficiency of the programmatic approach with three examples using smplot2.
There are several reasons why I support this method for plotting.
First, complex data visualizations, such as composite plots, can be
performed with concise code. Second, it increases code readability and
reproducibility because the pipeline remains very similar regardless of
the number of variables or lapply() functions. Third, it
lifts aesthetics limitations by integrating the native R programming
practices with ggplot2’s declarative syntax.
For example, users can create a complex composite plot, such as a
lower triangular matrix form, without relying on external packages. They
can use lapply() function to create empty panels in
specific panels, and then combine them using
sm_put_together(). The panels will be arranged in a
triangular layout. To create an empty plot, users can type
ggplot(NULL) + sm_common_legend() (see examples in Chapter
7 of the documentation webpage), which has no aesthetic mappings (no
legend). After creating a composite plot with
sm_put_together(), users can also add annotations in any
forms at any coordinates within the composite figure using the layered
approach of ggplot2. In short, by passing the baton of what can
be done in data visualizations using ggplot2 from external
packages to the native, programmatic functions of R, smplot2
liberates users with limitless possibilities, allowing them to perform
sophisticated visualizations using simple solutions and the superior
features of ggplot2.
Closing remarks
In this tutorial, I have introduced smplot2, an R package that provides a structured workflow for plotting by integrating a programmatic approach, as well as visualization and layout functions for advanced data visualizations. The defaults of the plots generated by the package are simple and minimalistic, and have also been optimized for subplotting so that individual components of the figure are still clearly visible in a composite plot. Also, the functions introduce a linear process of creating a composite figure by giving users a full control of aesthetics at multiple stages of plotting in ggplot2. I hope that the package can encourage more users to use R as part of their visualization routines.
Acknowledgment
This work was supported by a National Natural Science Foundation of China grant (#32350410414) and a National Foreign Expert Project (#QN2022016002L) fund. I thank smplot2 users who have given feedback and raised issues about the package since its inception. I am also grateful to Mengting Chen, Chenyan Zhou and Shiqi Zhou who tested the package numerous times during its development.